Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel ST_Union #698

Merged
merged 6 commits into from
Jul 3, 2022
Merged

parallel ST_Union #698

merged 6 commits into from
Jul 3, 2022

Conversation

mngr777
Copy link
Contributor

@mngr777 mngr777 commented Jun 16, 2022

Simple parallel ST_Union implementation: transfn just adds the values to a list, combinefn concatenates the lists from different
workers, and cascaded union is still fully calculated in finalfn. This doesn't affect execution time of ST_Union itself, but allows to
parallelize queries like SELECT ST_Union(ST_Buffer(...)) ....

@pramsey
Copy link
Member

pramsey commented Jun 16, 2022

Gotcha. That makes sense, I guess. The same behaviour would be available by going ST_Union(ST_Collect(ST_Buffer(...))) right? But this allowed the aggregate to "parallelize" and thus pick up efficiencies inside by default. +1

@mngr777
Copy link
Contributor Author

mngr777 commented Jun 16, 2022

@pramsey I get and error when trying this query, am I missing something?

# EXPLAIN SELECT ST_Union(ST_Collect(ST_Buffer(wkb_geometry, 0.001))) FROM usa;
ERROR:  aggregate function calls cannot be nested

@pramsey
Copy link
Member

pramsey commented Jun 16, 2022

Oh interesting, I guess we don't have a unary version of ST_Union (other than ST_UnaryUnion)

@pramsey
Copy link
Member

pramsey commented Jun 16, 2022

Maybe the same thing but like this would work: ST_Union(array_agg(ST_Buffer())) ... basically there's a version of union that takes an array.

@mngr777
Copy link
Contributor Author

mngr777 commented Jun 16, 2022

It looks like array_agg cannot be executed in parallel:

test=# SET parallel_setup_cost = 0;
test=# SET parallel_tuple_cost = 0;
test=# SET force_parallel_mode = on;
test=# SET min_parallel_table_scan_size = 0;
test=# SET max_parallel_workers_per_gather = 4;
test=# EXPLAIN SELECT ST_Union(array_agg(ST_Buffer(wkb_geometry, 0.1))) FROM usa_6;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Gather  (cost=82181.73..82206.74 rows=1 width=32)
   Workers Planned: 1
   Single Copy: true
   ->  Aggregate  (cost=82181.73..82206.74 rows=1 width=32)
         ->  Seq Scan on usa_6  (cost=0.00..723.58 rows=3258 width=16867)
(5 rows)

Grepping Postgres source, there are only array_agg_transfn and array_agg_finalfn functions.

@pramsey
Copy link
Member

pramsey commented Jun 16, 2022

That's very interesting. A little low-hanging core PgSQL performance fruit. PgSQL core devs don't believe that individual functions can ever have much cost, compared to i/o, so they leave this stuff lying around.

@kalenikaliaksandr
Copy link
Contributor

there is a patch that implements parallel support for array_agg that was written back in 2017 https://www.postgresql.org/message-id/flat/CAKJS1f9sx_6GTcvd6TMuZnNtCh0VhBzhX6FZqw17TgVFH-ga_A%40mail.gmail.com
but as I understood it was declined because it can break queries for users that rely on predictable order of array_agg.
looks like parallel array_agg will eventually get into pgsql after merging this patch https://commitfest.postgresql.org/38/3164/

@pramsey
Copy link
Member

pramsey commented Jun 16, 2022

Very subtle.

@kalenikaliaksandr
Copy link
Contributor

kalenikaliaksandr commented Jun 16, 2022

@pramsey I am wondering if possibility of allowing to specify custom allocator in GEOS was already discussed somewhere. Making GEOS to use palloc would unlock further improvement in parallel ST_Union.

postgis/lwgeom_union.c Outdated Show resolved Hide resolved
@mngr777
Copy link
Contributor Author

mngr777 commented Jun 21, 2022

Now that pgis_geometry_union_finalfn() function is not used, should I mark it somehow or just remove it?
If it can be removed, I can drop "_parallel_" from my function names.

@mngr777 mngr777 marked this pull request as ready for review June 21, 2022 08:40
@Komzpa
Copy link
Member

Komzpa commented Jun 21, 2022

Anything that was ever exposed via SQL API that needs to be removed goes to https://github.com/postgis/postgis/blob/master/postgis/postgis_legacy.c#L25 - you have to keep the signature for flawless binary-only system upgrades.

@robe2
Copy link
Member

robe2 commented Jun 21, 2022

The github action failure can be ignored. I think that is something to do with GDAL change.

The drone failure looks like a real issue because of missing drop of long deprecated function

/drone/src/regress/core/regress .. failed (diff expected obtained: /tmp/pgis_reg/test_46_diff)

-----------------------------------------------------------------------------

--- /drone/src/regress/core/regress_expected	2022-06-21 13:15:29.792734953 +0000

+++ /tmp/pgis_reg/test_46_out	2022-06-21 13:28:50.395571117 +0000

@@ -231,3 +231,4 @@

 314|676

 315|0

 316|t|LINESTRING(40 8.660254037844387,35 8.660254037844387)

+unexpected probin|pgis_geometry_union_finalfn:$libdir/postgis-2.5-uninstalled

-----------------------------------------------------------------------------

@mngr777
Copy link
Contributor Author

mngr777 commented Jun 21, 2022

GDAL issue is already fixed: OSGeo/gdal#5946, just need to rebuild test environment.

@strk
Copy link
Member

strk commented Jun 22, 2022 via email

@mngr777
Copy link
Contributor Author

mngr777 commented Jun 23, 2022

  • pgis_geometry_union_finalfn is now dropped in postgis/postgis_after_upgrade.sql.

  • hook-before-upgrade.sql and hook-after-upgrade.sql already have test for views using ST_Union aggregate (the version with a single argument):

    -- We know upgrading with an st_union() based view

  • I upgraded from 2.5.5 on Postgres 14 and checked ST_Union aggregate has been updated by running

SELECT p.proname, a.aggserialfn, a.aggdeserialfn, a.aggcombinefn, a.aggfinalfn FROM pg_proc AS p JOIN pg_aggregate as a ON a.aggfnoid = p.oid WHERE p.prokind='a' AND p.proname LIKE 'st_union'

and

SET parallel_setup_cost = 0;
SET parallel_tuple_cost = 0;
SET force_parallel_mode = on;
SET min_parallel_table_scan_size = 0;
SET max_parallel_workers_per_gather = 4;
EXPLAIN SELECT ST_Union(ST_Buffer(wkb_geometry, 0.001)[, 1]) FROM usa_6
  • postgis_proc_upgrade.pl adds SQL for ST_Union(geom) and ST_Union(geom, gridSize) aggregates into postgis_upgrade.sql.in:
-- Aggregate ST_Union(geometry) -- LastUpdated: 303
DO LANGUAGE 'plpgsql'
$postgis_proc_upgrade$
BEGIN
<...>
END
$postgis_proc_upgrade$;
-- Aggregate ST_Union (geometry,gridSize float8) -- LastUpdated: 303
DO LANGUAGE 'plpgsql'
$postgis_proc_upgrade$
BEGIN
<...>

@strk strk merged commit 17b7cda into postgis:master Jul 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants